</font>
</font>
import numpy as np
# importing the necessary package for performing advanced mathematical operation
import pandas as pd
# importing the necessary package for managing data
import seaborn as sns
import matplotlib.pyplot as plt
# importing the necessary packages for visualisation
sns.set (color_codes = True) # it will add a nice background to the graphs
%matplotlib inline
# commmand to tell Python to display my graphs
sns.set_style(style= 'darkgrid')
# pre-processing method
from sklearn.model_selection import train_test_split
# the Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost
from mlxtend.classifier import StackingClassifier
# methods and classes for evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn import metrics
from sklearn.model_selection import cross_validate
df = pd.read_csv('Data - Parkinsons') # Importing the dataset
df.head(7).T # Checking for the top 7 rows to have an idea about the dataset
df.tail(7).T # Checking for the bottom 7 rows to have an idea about the dataset
# Here we will be checking the missing values, unique values, Duplicate values and some rows present in the dataset
def indetailtable(df): # Defining a method with input argument as the dataframe and inside this method we shall be calling various objects
print(f'Dataset Shape : {df.shape}') # attribute defined to print the shape of dataset
print('Total Number of rows in dataset = {}'.format(df.shape[0])) # attribute defined to print the No. of rows present in dataset
print('Total Number of columns in dataset = {}'.format(df.shape[1])) # attribute defined to print the No. of columns present in dataset
print('Various datatypes present in the dataseta are: {}\n'.format(df.dtypes.value_counts())) # checking for the various datatypes present in the dataset
# Below an attribute is defined to make a dataframe with help of Pandas profiling, this will give us the summary of the dataset
summary = pd.DataFrame(df.dtypes,columns = ['dtypes']) # attribute - Dataframe
summary = summary.reset_index() # attribute -> Summary-DataFrame -> Index Values defined
summary['Name'] = summary['index'] # attribute -> Summary-DataFrame -> Name of the columns/features defined as index name
summary = summary[['Name', 'dtypes']] # attribute -> Summary-DataFrame -> to execute the datatypes of various columns/featurs
summary['Missing_values'] = df.isnull().sum().values # attribute -> Summary-DataFrame -> to check the missing values if any present in dataset
summary['Unique_values'] = df.nunique().values # attribute -> Summary-DataFrame -> to check the unique values if any present in dataset
summary['Duplicat_Values'] = df.duplicated().sum() # attribute -> Summary-DataFrame -> to check the duplicate values if any present in dataset
summary['1st Value'] = df.loc[0].values # attribute -> Summary-DataFrame -> to see the first row of the dataset
summary['2nd Value'] = df.loc[1].values # attribute -> Summary-DataFrame -> to see the second row of the dataset
summary['194th Value'] = df.loc[193].values # attribute -> Summary-DataFrame -> to see the 194th row of the dataset
summary['195th Value'] = df.loc[194].values # attribute -> Summary-DataFrame -> to see the 195th row of the dataset
return summary # will return the 'summary dataframe' defined as an attribute inside the class 'indetailtable' defined inside the method
# Here 'indetailtable' is a class/datatype defined in the above method, so we will be calling the above method to run.
brief = indetailtable(df) # assigning the class defined above and executing the code for the current dataset
brief # printing the dataset
There are basically 195 number of data points / observations and 24 number of columns/features present in the dataset.
Out of which 22 number of rows are of float data type, one is integer, and one is object. The column ‘Name’ is object datatype in nature and it contains names and recording numbers in ASCII codes format.
There are no missing values present in the dataset, so it can be treated as clean in terms of presence of missing values.
Similarly, there are no duplicate values present in the dataset, so it can also be treated as clean in terms of duplicate values.
Except the status column which is our target column, all other columns are filled with many number of unique values.
The status column which is integer in terms of datatype has two unique values such as 1 and 0. Where, 0 signifies the Healthy Patient and 1 signifies the patient affected by Parkinson’s disease.
df.describe().T # Looking for the statistical Inferences
In the above table 5 point summary have been described, i.e., minimum value, 25%, 50%, 75%, and maximum value.
• In the above statistical distribution analysis, it is noticed that, most of the features are having unique values and have a unique distribution.
• Each columns do have different units or scales. Some of the columns indicate measure of frequency, whose unit is in Hertz (Hz). Thus these columns have unique values and unique distribution as per their scales.
• Some of the columns give measure of Amplitude Perturbation Measure of the sound waves, these are the sound waves derived from the voice recordings of patients. Thus the scale of these features pertaining to amplitude are also different than frequency. Therefore, this unit differentiate the amplitude from frequency features.
• Similarly, NHR & HNR, both are ratios, so they might be unit less or have different unit, which differentiate themselves from other features.
• The same assumption can also be followed up for rest of the features.
• Some of the features have values mostly more than 100 and some of them are below 1. Thus uneven distribution of the feature columns can be observed from the above statistical description.
• The status column which is binomial in nature has two categories such as 0 and 1.
• The column Spread1, which gives a nonlinear measure of fundamental frequency has only negative value, which has unique value among all the columns. It has a maximum value of -2.434 and minimum value of -7.964.
Further Checking individually
Feature names contain certain characters which may show error while doing any operation on the dataset. Therefore, it is better to avoid any kind of errors at later stage of analysis, here we can change or modify the column names.
# Renaming the columns
df = df.rename(columns = {'MDVP:Fo(Hz)':'MDVP_Fo_Hz', 'MDVP:Fhi(Hz)': 'MDVP_Fhi_Hz', 'MDVP:Flo(Hz)':'MDVP_Flo_Hz',
'MDVP:Jitter(%)':'MDVP_Jitter_per', 'MDVP:Jitter(Abs)':'MDVP_Jitter_Abs','MDVP:RAP':'MDVP_RAP',
'MDVP:PPQ':'MDVP_PPQ','Jitter:DDP':'Jitter_DDP','MDVP:Shimmer':'MDVP_Shimmer', 'MDVP:Shimmer(dB)':'MDVP_Shimmer_dB',
'Shimmer:APQ3':'Shimmer_APQ3','Shimmer:APQ5':'Shimmer_APQ5','MDVP:APQ':'MDVP_APQ','Shimmer:DDA':'Shimmer_DDA'})
df.columns
Using the below code we are going to test whether this data frame consists of categorical variables with values such as “?”.
for value in ['MDVP_Fo_Hz', 'MDVP_Fhi_Hz', 'MDVP_Flo_Hz', 'MDVP_Jitter_per',
'MDVP_Jitter_Abs', 'MDVP_RAP', 'MDVP_PPQ', 'Jitter_DDP', 'MDVP_Shimmer',
'MDVP_Shimmer_dB', 'Shimmer_APQ3', 'Shimmer_APQ5', 'MDVP_APQ',
'Shimmer_DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA', 'spread1',
'spread2', 'D2', 'PPE']:
print(value,":", sum(df[value] == '?'))
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
plt.hist(df.MDVP_Fo_Hz, color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Average Vocal Fundamental Frequecy')
plt.subplot(3,3,2)
plt.hist(df.MDVP_Fhi_Hz, color = 'red', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Maximum Vocal Fundamental Frequency')
plt.subplot(3,3,3)
plt.hist(df.MDVP_Flo_Hz, color = 'blue', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Minimum Vocal Fundamental Fequency') ;
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.violinplot(df['MDVP_Fhi_Hz'], color = 'green')
plt.xlabel('Average Vocal Fundamental Frequecy')
plt.subplot(3,3,2)
sns.violinplot(df['MDVP_Fhi_Hz'], color = 'red')
plt.xlabel('Maximum Vocal Fundamental Frequency')
plt.subplot(3,3,3)
sns.violinplot(df['MDVP_Flo_Hz'], color = 'blue')
plt.xlabel('Minimum Vocal Fundamental Fequency')
fund_frq_col = ['MDVP_Fo_Hz','MDVP_Fhi_Hz', 'MDVP_Flo_Hz']
df[fund_frq_col].plot.hist(alpha=0.5,bins=50);
The above graph shows a comparative view of ranges between Average, Maximum and minimum vocal fundamental frequency.
It is to be noted that, Maximum vocal fundamental frequency has a very high range compared to other two.
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.scatterplot(df.MDVP_Fo_Hz, df.MDVP_Fhi_Hz, hue = df.status, palette = ['red', 'blue'])
plt.subplot(3,3,2)
sns.scatterplot(df.MDVP_Fo_Hz, df.MDVP_Flo_Hz, hue = df.status, palette = ['purple', 'orange'])
plt.subplot(3,3,3)
sns.scatterplot(df.MDVP_Flo_Hz, df.MDVP_Fhi_Hz, hue = df.status, palette = ['red', 'green']);
MDVP_Jitter_per: MDVP Jitter in Percentage MDVP_Jitter_Abs: MDVP Jitter in Absolute value MDVP_RAP: MDVP relative amplitude perturbation MDVP_PPQ: MDVP period perturbation quotient Jitter_DDP: Average absolute difference of differences between jitter cycles
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
plt.hist(df.MDVP_Jitter_per, color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('MDVP Jitter in Percentage')
plt.subplot(3,3,2)
plt.hist(df.MDVP_Jitter_Abs, color = 'red', edgecolor = 'black', alpha = 0.7);
plt.xlabel('MDVP Jitter in Absolute value')
plt.subplot(3,3,3)
plt.hist(df.MDVP_RAP, color = 'blue', edgecolor = 'black', alpha = 0.7);
plt.xlabel('MDVP relative amplitude perturbation')
plt.subplot(3,3,4)
plt.hist(df.MDVP_PPQ, color = 'orange', edgecolor = 'black', alpha = 0.7);
plt.xlabel('MDVP period perturbation quotient')
plt.subplot(3,3,5)
plt.hist(df.Jitter_DDP, color = 'purple', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Average absolute difference of differences between jitter cycles');
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.violinplot(df['MDVP_Jitter_per'], color = 'green')
plt.xlabel('MDVP Jitter in Percentage')
plt.subplot(3,3,2)
sns.violinplot(df['MDVP_Jitter_Abs'], color = 'red')
plt.xlabel('MDVP Jitter in Absolute value')
plt.subplot(3,3,3)
sns.violinplot(df['MDVP_RAP'], color = 'blue')
plt.xlabel('MDVP relative amplitude perturbation')
plt.subplot(3,3,4)
sns.violinplot(df['MDVP_PPQ'], color = 'orange')
plt.xlabel('MDVP period perturbation quotient')
plt.subplot(3,3,5)
sns.violinplot(df['Jitter_DDP'], color = 'purple')
plt.xlabel('Average absolute difference of differences between jitter cycles') ;
fund_freq_meas = ['MDVP_Jitter_per','MDVP_Jitter_Abs','MDVP_RAP','MDVP_PPQ','Jitter_DDP' ]
df[fund_freq_meas].plot.hist(alpha=0.4, bins = 30);
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.scatterplot(df.MDVP_Jitter_per, df.MDVP_Jitter_Abs, hue = df.status, palette = ['red', 'blue'])
plt.subplot(3,3,2)
sns.scatterplot(df.MDVP_Jitter_Abs, df.MDVP_RAP, hue = df.status, palette = ['purple', 'orange'])
plt.subplot(3,3,3)
sns.scatterplot(df.MDVP_RAP, df.MDVP_Jitter_per, hue = df.status, palette = ['red', 'green']);
plt.subplot(3,3,4)
sns.scatterplot(df.MDVP_RAP, df.Jitter_DDP, hue = df.status, palette = ['orange', 'green']);
plt.subplot(3,3,5)
sns.scatterplot(df.MDVP_PPQ, df.Jitter_DDP, hue = df.status, palette = ['blue', 'orange']);
plt.subplot(3,3,6)
sns.scatterplot(df.MDVP_Jitter_per, df.Jitter_DDP, hue = df.status, palette = ['purple', 'orange']);
The above plot shows us the correlation between multiple columns. Features like MDVP_RAP, Jitter_DDP, MDVP_Jitter_Abs ets. have a very good correlaion among them. it means they directly affect ech other and one represents another.
The correlation between MDVP_Jitter_Abs and MDVP_RAP is quit unique in nature. Means MDVP_Jitter_abs has no effect on MDVP_RAP.
MDVP_Shimmer:MDVP local shimmer, MDVP_Shimmer_dB:MDVP local shimmer in dB, Shimmer_APQ3: Three-point amplitude perturbation quotient, Shimmer_APQ5: Five-point amplitude perturbation quotient, MDVP_APQ:MDVP amplitude perturbation quotient, Shimmer_DDA:Average absolute differences between the amplitudes of consecutive periods
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
plt.hist(df.MDVP_Shimmer, color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('MDVP local shimmer')
plt.subplot(3,3,2)
plt.hist(df.MDVP_Shimmer_dB, color = 'red', edgecolor = 'black', alpha = 0.7);
# sns.distplot(df['MDVP_Fhi_Hz'], color = 'red')
plt.xlabel('MDVP local shimmer in dB')
plt.subplot(3,3,3)
plt.hist(df.Shimmer_APQ3, color = 'blue', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Three-point amplitude perturbation quotient')
plt.subplot(3,3,4)
plt.hist(df.Shimmer_APQ5, color = 'orange', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Five-point amplitude perturbation quotient')
plt.subplot(3,3,5)
plt.hist(df.MDVP_APQ, color = 'purple', edgecolor = 'black', alpha = 0.7);
plt.xlabel('MDVP amplitude perturbation quotient')
plt.subplot(3,3,6)
plt.hist(df.Shimmer_DDA, color = 'violet', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Average absolute differences between the amplitudes of consecutive periods') ;
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.violinplot(df['MDVP_Shimmer'], color = 'green');
plt.xlabel('MDVP local shimmer')
plt.subplot(3,3,2)
sns.violinplot(df['MDVP_Shimmer_dB'], color = 'red');
plt.xlabel('MDVP local shimmer in dB')
plt.subplot(3,3,3)
sns.violinplot(df['Shimmer_APQ3'], color = 'blue');
plt.xlabel('Three-point amplitude perturbation quotient')
plt.subplot(3,3,4)
sns.violinplot(df['Shimmer_APQ5'], color = 'orange');
plt.xlabel('Five-point amplitude perturbation quotient')
plt.subplot(3,3,5)
sns.violinplot(df['MDVP_APQ'], color = 'purple');
plt.xlabel('MDVP amplitude perturbation quotient')
plt.subplot(3,3,6)
sns.violinplot(df['Shimmer_DDA'], color = 'violet')
plt.xlabel('Average absolute differences between the amplitudes of consecutive periods')
NHR:Noise-to-harmonics ratio, HNR:Harmonics-to-noise ratio, RPDE:Recurrence period density entropy measure, DFA:Signal fractal scaling exponent of detrended fluctuation analysis, spread1:Two nonlinear measures of fundamental, spread2:Frequency variation, D2:Correlation dimension, PPE:Pitch period entropy
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
plt.hist(df.NHR, color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Noise-to-harmonics ratio')
plt.subplot(3,3,2)
plt.hist(df.HNR, color = 'red', edgecolor = 'black', alpha = 0.7);
# sns.distplot(df['MDVP_Fhi_Hz'], color = 'red')
plt.xlabel('Harmonics-to-noise ratio')
plt.subplot(3,3,3)
plt.hist(df.DFA, color = 'blue', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Signal fractal scaling exponent of detrended fluctuation analysis')
plt.subplot(3,3,4)
plt.hist(df.spread1, color = 'orange', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Two nonlinear measures of fundamental')
plt.subplot(3,3,5)
plt.hist(df.spread2, color = 'purple', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Frequency variation')
plt.subplot(3,3,6)
plt.hist(df.D2, color = 'violet', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Correlation dimension')
plt.subplot(3,3,7)
plt.hist(df.PPE, color = 'teal', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Pitch period entropy')
plt.subplot(3,3,8)
plt.hist(df.RPDE, color = 'turquoise', edgecolor = 'black', alpha = 0.7);
plt.xlabel('NonLinear Dynamic Complexity Measures-2');
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
# plt.hist(df.NHR, color = 'green', edgecolor = 'black', alpha = 0.7);
sns.violinplot(df['NHR'], color = 'green')
plt.xlabel('Noise-to-harmonics ratio')
plt.subplot(3,3,2)
# plt.hist(df.HNR, color = 'red', edgecolor = 'black', alpha = 0.7);
sns.violinplot(df['HNR'], color = 'red')
plt.xlabel('Harmonics-to-noise ratio')
plt.subplot(3,3,3)
# plt.hist(df.DFA, color = 'blue', edgecolor = 'black', alpha = 0.7);
sns.violinplot(df['DFA'], color = 'blue')
plt.xlabel('Signal fractal scaling exponent of detrended fluctuation analysis')
plt.subplot(3,3,4)
# plt.hist(df.spread1, color = 'orange', edgecolor = 'black', alpha = 0.7);
sns.violinplot(df['spread1'], color = 'orange')
plt.xlabel('Nonlinear measures of fundamental Frequency variation')
plt.subplot(3,3,5)
# plt.hist(df.spread2, color = 'purple', edgecolor = 'black', alpha = 0.7);
sns.violinplot(df['spread2'], color = 'purple')
plt.xlabel('Nonlinear measures of fundamental Frequency variation')
plt.subplot(3,3,6)
# plt.hist(df.D2, color = 'violet', edgecolor = 'black', alpha = 0.7);
sns.violinplot(df['D2'], color = 'violet')
plt.xlabel('Correlation dimension')
plt.subplot(3,3,7)
sns.violinplot(df['PPE'], color = 'teal')
plt.xlabel('Pitch period entropy')
plt.subplot(3,3,8)
sns.violinplot(df['RPDE'], color = 'turquoise')
plt.xlabel('NonLinear Dynamic Complexity Measures-2');
# for doing statistical calculation
import scipy
from sklearn import linear_model
import statsmodels.api as sm
from sklearn import metrics
from sklearn import datasets
import scipy.stats as stats
from scipy.stats import skew
# Preparing a pandas dataframe to store the skewness of each column.
Skewness = pd.DataFrame({'Skewness': [stats.skew(df.MDVP_Fo_Hz), stats.skew(df.MDVP_Fhi_Hz), stats.skew(df.MDVP_Flo_Hz),
stats.skew(df.MDVP_Jitter_per), stats.skew(df.MDVP_Jitter_Abs), stats.skew(df.MDVP_RAP),
stats.skew(df.MDVP_PPQ), stats.skew(df.Jitter_DDP), stats.skew(df.MDVP_Shimmer),
stats.skew(df.MDVP_Shimmer_dB), stats.skew(df.Shimmer_APQ3), stats.skew(df.Shimmer_APQ5),
stats.skew(df.MDVP_APQ), stats.skew(df.Shimmer_DDA), stats.skew(df.NHR), stats.skew(df.HNR),
stats.skew(df.RPDE), stats.skew(df.DFA), stats.skew(df.spread1),
stats.skew(df.spread2), stats.skew(df.D2), stats.skew(df.PPE)]},
index = ['MDVP_Fo_Hz', 'MDVP_Fhi_Hz', 'MDVP_Flo_Hz', 'MDVP_Jitter_per','MDVP_Jitter_Abs', 'MDVP_RAP',
'MDVP_PPQ', 'Jitter_DDP', 'MDVP_Shimmer','MDVP_Shimmer_dB', 'Shimmer_APQ3', 'Shimmer_APQ5',
'MDVP_APQ','Shimmer_DDA','NHR', 'HNR','RPDE','DFA','spread1','spread2', 'D2','PPE'])
Skewness
plt.figure(figsize = (18,9))
plt.subplot(3,2,1)
sns.boxplot(x= df.MDVP_Fo_Hz, color='green')
plt.xlabel('Average Vocal Fundamental Frequecy')
plt.subplot(3,2,2)
sns.boxplot(x= df.MDVP_Fhi_Hz, color='red')
plt.xlabel('Maximum Vocal Fundamental Frequency')
plt.subplot(3,2,3)
sns.boxplot(x= df.MDVP_Flo_Hz, color='blue')
plt.xlabel('Minimum Vocal Fundamental Fequency') ;
plt.figure(figsize = (20,22))
plt.subplot(5,2,1)
sns.boxplot(x= df.MDVP_Jitter_per, palette = 'gist_heat')
plt.xlabel('MDVP Jitter in Percentage')
plt.subplot(5,2,2)
sns.boxplot(x= df.MDVP_Jitter_Abs, palette = 'Pastel2_r')
plt.xlabel('MDVP Jitter in Absolute value')
plt.subplot(5,2,3)
sns.boxplot(x= df.MDVP_RAP, palette = 'viridis')
plt.xlabel('MDVP relative amplitude perturbation')
plt.subplot(5,2,4)
sns.boxplot(x= df.MDVP_PPQ, palette = 'copper')
plt.xlabel('MDVP period perturbation quotient')
plt.subplot(5,2,5)
sns.boxplot(x= df.Jitter_DDP, palette = 'YlOrBr')
plt.xlabel('Average absolute difference of differences between jitter cycles');
plt.figure(figsize = (20,25))
plt.subplot(6,2,1)
sns.boxplot(x= df.MDVP_Shimmer,color = 'green');
plt.xlabel('MDVP local shimmer')
plt.subplot(6,2,2)
sns.boxplot(x= df.MDVP_Shimmer_dB, color = 'red');
plt.xlabel('MDVP local shimmer in dB')
plt.subplot(6,2,3)
sns.boxplot(x= df.Shimmer_APQ3, color = 'blue');
plt.xlabel('Three-point amplitude perturbation quotient')
plt.subplot(6,2,4)
sns.boxplot(x= df.Shimmer_APQ5, color = 'orange');
plt.xlabel('Five-point amplitude perturbation quotient')
plt.subplot(6,2,5)
sns.boxplot(x= df.MDVP_APQ, color = 'purple');
plt.xlabel('MDVP amplitude perturbation quotient')
plt.subplot(6,2,6)
sns.boxplot(x= df.Shimmer_DDA, color = 'violet')
plt.xlabel('Average absolute differences between the amplitudes of consecutive periods');
plt.figure(figsize = (20,25))
plt.subplot(8,2,1)
sns.boxplot(x= df.NHR, palette = 'twilight')
plt.xlabel('Noise-to-harmonics ratio')
plt.subplot(8,2,2)
sns.boxplot(x= df.HNR, palette = 'inferno')
plt.xlabel('Harmonics-to-noise ratio')
plt.subplot(8,2,3)
sns.boxplot(x= df.DFA, palette = 'pink')
plt.xlabel('Signal fractal scaling exponent')
plt.subplot(8,2,4)
sns.boxplot(x= df.spread1, palette = 'plasma')
plt.xlabel('Nonlinear measures of fundamental Frequency variation')
plt.subplot(8,2,5)
sns.boxplot(x= df.spread2, palette = 'Purples')
plt.xlabel('Nonlinear measures of fundamental Frequency variation')
plt.subplot(8,2,6)
sns.boxplot(x= df.D2, palette = 'autumn')
plt.xlabel('Non Linear dynamicaly complexity measure - 1')
plt.subplot(8,2,7)
sns.boxplot(x= df.PPE, palette = 'spring')
plt.xlabel('Nonlinear measures of fundamental Frequency variation - Pitch period entropy') ;
plt.subplot(8,2,8)
sns.boxplot(x= df.RPDE, palette = 'Greens')
plt.xlabel('Non Linear dynamicaly complexity measure - 2') ;
# Checking the value of counts for Status of PD
pd.value_counts(df['status'])
plt.figure(figsize = (20,15))
plt.subplot(2,2,1)
pd.value_counts(df['status']).plot(kind = 'bar', color = 'purple'); # to plot a bar chart
plt.xlabel('Status of Parkinson Disease');
plt.subplot(2,2,2)
# Data to plot
labels = 'Tested Positive', 'Tested Negative'
sizes = [147, 48]
colors = ['yellowgreen', 'violet']
explode = (0.1, 0) # explode 1st slice
# Plot
plt.pie(sizes,explode=explode, labels=labels, colors=colors, # to plot a pie chart
autopct='%1.1f%%', shadow=True, startangle= - 120)
plt.xlabel('Status of Parkinson Disease');
plt.axis('equal')
plt.show()
n_true = len(df.loc[df['status'] == True])
n_false = len(df.loc[df['status'] == False])
print("Number of Patients affected by Parkinson Dieses: {0} ({1:2.2f}%)".format(n_true, (n_true / (n_true + n_false)) * 100 ))
print("Number of Healthy Patients: {0} ({1:2.2f}%)".format(n_false, (n_false / (n_true + n_false)) * 100))
df.groupby(['status']).count()
75.38% of patients are detected to be affected by PD and 24.62% of patients are not affected by PD
This dataset is skewed in terms of target column, due to uneven distribution of patients having PD and not having PD.
There are few records in class 0, i.e. patients who's health status is good and not affected by PD.
And the main target or class of interest is to correctly identify the patients status on PD, through their voice recordings !
Here we do not want to misclassify any patient preferablly patients who are affected by PD and detected or recognized as not affected by PD, i.e. we have to minimize the False Negative or Type II error to gain the maximum predection accuracy and to save precious life.
sns.pairplot(df, hue="status", palette="husl");
df.corr().T
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (15,12))
plt.title('Pearson Correlation of attributes', y=1, size = 20)
sns.heatmap(df.corr(), linewidth = 0.2, vmax = 1.0,
square = True, cmap = colormap,linecolor = 'red', annot = True);
The correlation matrix between pairs of vocal measures is shown is above. It is noted that a number of vocal measures are highly correlated with the others, because some measures indicated the similar characteristics of the acoustic signals in the data set. They are highly correlated to each other in such way that, the individual effect of both columns will have same impact on the target column. These columns are highly dependent to each other and a small change in value of one column will affect another column in same way.
The similar vocal measure groups are period perturbations (MDVP: Jitter (%), MDVP: Jitter (Abs), MDVP: RAP, MDVP: PPQ, Jitter: DDP), amplitude perturbations (MDVP: Shimmer, MDVP: Shimmer (dB), Shimmer: APQ3, Shimmer: APQ5, MDVP: APQ11, Shimmer: DDA), and nonlinear measures (DFA, PPE). In particular, the Shimmer: DDA and Shimmer: APQ3 measures exhibit completely collinear relationship (the correlation coefficient equal to 1).
These highly correlated columns/ features with correlation value more than equal to 0.95 can be dropped off with proper hyper parameter tuning and feature engineering process.
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.scatterplot(df.D2, df.spread1, hue = df.status, palette = ['red', 'blue'])
plt.title('NonLinear Dynamical Complexity Vs. Nonlinear Fund. Freq.');
plt.subplot(3,3,2)
sns.scatterplot(df.HNR, df.MDVP_Shimmer_dB, hue = df.status, palette = ['purple', 'orange'])
plt.title('Harmonics to Noise Ratio Vs. Measure in Variation in Amplitude');
plt.subplot(3,3,3)
sns.scatterplot(df.HNR, df.NHR, hue = df.status, palette = ['red', 'green']);
plt.title('Harmonics to Noise Ratio Vs. Noise to Harmonics Ratio');
plt.subplot(3,3,4)
sns.scatterplot(df.HNR, df.MDVP_Fo_Hz, hue = df.status, palette = ['orange', 'green']);
plt.title('Harmonics to Noise Ratio Vs. nonlinear Fundamenta; frequency');
plt.subplot(3,3,5)
sns.scatterplot(df.Shimmer_APQ3, df.Shimmer_DDA, hue = df.status, palette = ['blue', 'orange']);
plt.title('Comparision between two different measures of Amplitude variation');
plt.subplot(3,3,6)
sns.scatterplot(df.spread1, df.MDVP_Fo_Hz, hue = df.status, palette='nipy_spectral');
plt.title('NonLinear measure of Fund. Freq. Vs. Avg. vocal Fund. Freq.');
plt.subplot(3,3,7)
sns.scatterplot(df.RPDE, df.DFA, hue = df.status, palette='Set1');
plt.title('NonLinear dynamical complexity Vs. Signal Fractal Scaling Component');
plt.subplot(3,3,8)
sns.scatterplot(df.MDVP_Jitter_Abs, df.D2, hue = df.status, palette = ['purple', 'green']);
plt.title('Variation in Fund. Freq. Vs. NonLinear dynamical complexity');
plt.subplot(3,3,9)
sns.scatterplot(df.Shimmer_APQ5, df.MDVP_Jitter_Abs, hue = df.status, palette = ['Blue', 'purple']);
plt.title('Variation in Amplitude Vs. Fundamental Frequency');
The above scatter plots were drawn between various selected features to check the correlation among them in detail.
Thus, form the above plots it can be inferred that, some of the features are nonlinearly correlated to each other and some of them are linearly related, however, the linear relationship among them varies feature to feature.
Some of the features are correlated with either zero slop or infinite slop of the line.
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.scatterplot(df.DFA, df.MDVP_Fhi_Hz, hue = df.status, palette = ['purple', 'orange']);
plt.title('Signal Fractal Scaling Component Vs. Max. Vocal Fund. Freq.');
plt.subplot(3,3,2)
sns.scatterplot(df.DFA, df.MDVP_PPQ, hue = df.status, palette = ['red', 'green']);
plt.title('Signal Fractal Scaling Component Vs. Measures of Variation Fund. Freq.');
plt.subplot(3,3,3)
sns.scatterplot(df.DFA, df.MDVP_APQ, hue = df.status, palette='plasma_r');
plt.title('Signal Fractal Scaling Component Vs. Measures of Variation in Amplitude');
From the above scatter plot, it can be inferred that, between the feature DFA and MDVP_Fhi_Hz an arbitrary relationship exists, by which one feature doesn’t affect the other. Outliers can be observed between these two features.
Similarly DFA (signal fractal scaling component) has a nonlinear relationship between MDVO_PPQ and MDVP_APQ, which are measure of variation in fundamental frequency and measure of variation in amplitude respectively. m
Out of several measures of fundamental frequencies, MDVP_RAP can be dropped off, since,this feature column is highly coorealated to other measures of variation in fundamental frequencies.
Out of several measures of amplitude, MDVP_APQ3 can be dropped off, since,this feature column is highly coorealated to other measures of variation in amplitude.
Out of three non linear measures, any one can be deleted between spread 1 and PPE, since these two are highly coorelatated. In addition to this, these two are also having a very good coorelation with other independent attributes.
df.columns
df = df.drop(['name'], axis = 1)
df.columns
from sklearn.model_selection import train_test_split
X = df.drop('status', axis = 1) # predictor feature columns or independent columns (22 * m)
y = df['status'] # target / predicted class (1 = Parkinson, 0 = Healthy) (1 * m)
y.head()
X.head()
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 537)
x_train.head()
# checking the split of data
print('{0:0.2f}% data is in training set'.format((len(x_train)/len(df.index))*100))
print('{0:0.2f}% data is in testing set'.format((len(x_test)/len(df.index))*100))
print("Total No. of Healthy persons present in given dataset: {0} ({1:0.2f}%)".format(len(df.loc[df['status'] == 1]), (len(df.loc[df['status'] == 1])/len(df.index)) * 100))
print("Total No. of patients affected by Parkinson Disease as per given dataset: {0} ({1:0.2f}%)".format(len(df.loc[df['status'] == 0]), (len(df.loc[df['status'] == 0])/len(df.index)) * 100))
print("")
print("Patients not affected by Parkinson Disease present in Training set: {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Patients affected by Parkinson Disease present in Training set: {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Patients not affected by Parkinson Disease present in Testing set: {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Patients affected by Parkinson Disease present in Testing set: {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
# Correlation of entire dataframe
corr_matrix = df.corr()
# Features more related to Patient's Status
corr_matrix['status'].sort_values(ascending = False)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
# Fit the model on train
model_lr = LogisticRegression(solver = 'liblinear')
model_lr.fit(x_train, y_train)
# Predict on test
y_predict = model_lr.predict(x_test)
# Checking for the coefficient / slope of all columns and intercept.
coef_df = pd.DataFrame(model_lr.coef_)
coef_df['intercepter'] = model_lr.intercept_
print(coef_df)
model_score_train = model_lr.score(x_train, y_train)
print('The accuracy of train model is {0:0.2f}%'.format(model_score_train*100))
model_score_test = model_lr.score(x_test, y_test)
print("The accuracy of Test model is {0:0.2f}% ".format(model_score_test*100))
cm = metrics.confusion_matrix(y_test, y_predict, labels = [1,0])
df_cm = pd.DataFrame(cm, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (10,5))
sns.heatmap(df_cm, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red' );
print(metrics.classification_report(y_test, y_predict))
resultsdf = pd.DataFrame({'Technique w/o scaling': ['Logistic Regression'], 'Accuracy (%)': [model_score_test*100] })
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
from sklearn.naive_bayes import GaussianNB
model_GNB = GaussianNB()
model_GNB_fit = model_GNB.fit(x_train, y_train)
y_train_pred_GNB = model_GNB.predict(x_train)
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score
y_test_pred_GNB = model_GNB.predict(x_test)
print('Testing Model Accuracy: {0:0.2f}%'.format(metrics.accuracy_score(y_test, y_test_pred_GNB)))
cmGNB = metrics.confusion_matrix(y_test, y_test_pred_GNB, labels = [1,0])
df_cmGNB = pd.DataFrame(cmGNB, index = [ i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmGNB, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y_test, y_test_pred_GNB, labels = [1,0]))
Here f1-score is more than 70%, however the FN value is coming very high..
tempresultsdf = pd.DataFrame({'Technique w/o scaling': ['GNB'], 'Accuracy (%)': [metrics.accuracy_score(y_test, y_test_pred_GNB)*100] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
# Loading Library
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
# Learning the model (k=3)
knn3 = KNeighborsClassifier(n_neighbors = 3)
# Fitting the model
knn_fit3 = knn3.fit(x_train, y_train)
y_train_pred_knn3 = knn_fit3.predict(x_train)
print('Training Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y_train, y_train_pred_knn3)))
y_test_pred_knn3 = knn_fit3.predict(x_test)
print('Testing Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y_test, y_test_pred_knn3)))
print()
# Learning the model (k=5)
knn5 = KNeighborsClassifier(n_neighbors = 5)
# Fitting the model
knn_fit5 = knn5.fit(x_train, y_train)
y_train_pred_knn5 = knn_fit5.predict(x_train)
print('Training Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y_train, y_train_pred_knn5)))
y_test_pred_knn5 = knn_fit5.predict(x_test)
print('Testing Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y_test, y_test_pred_knn5)))
print()
# Learning the model (k=9)
knn9 = KNeighborsClassifier(n_neighbors = 9)
# Fitting the model
knn_fit9 = knn9.fit(x_train, y_train)
y_train_pred_knn9 = knn_fit9.predict(x_train)
print('Training Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y_train, y_train_pred_knn9)))
y_test_pred_knn9 = knn_fit9.predict(x_test)
print('Testing Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y_test, y_test_pred_knn9)))
print()
# Creating odd list of k for KNN
mylist = list(range(1,20))
# Substituting just the odd ones
neighbors = list(filter(lambda x: x % 2!=0, mylist))
# Creating an empty list to hold accurcy scores
ac_scores = []
# Performing accuracy metrics for values from 1,3,5,7...19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(x_train, y_train)
# predict the response
y_predKNN = knn.predict(x_test)
# Evaluate accuracy
scores = accuracy_score(y_test, y_predKNN)
ac_scores.append(scores)
# Changing to miscalssification error
MSE = [1-x for x in ac_scores]
# Determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print('The Optimal number of neighbors is %d'% optimal_k)
# Misclassification error Vs. K
plt.plot(neighbors, MSE, color = 'red')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
# Using k = 7 for the final model for prediction
knn = KNeighborsClassifier(n_neighbors = 7)
# Fitting the model
knn_fit = knn.fit(x_train, y_train)
y_train_pred_knn = knn_fit.predict(x_train)
print('Training Model Accuracy value for k=7: {0:0.2f}% '.format(metrics.accuracy_score(y_train, y_train_pred_knn)))
print('Training Model recall value for k = 7:{0:0.2f}%'.format(metrics.recall_score(y_train, y_train_pred_knn)))
print()
y_test_pred_knn = knn_fit.predict(x_test)
print('Testing Model Accuracy value for k=7: {0:0.2f}%'.format(metrics.accuracy_score(y_test, y_test_pred_knn)))
print('Testing Model Recall value for k=7: {0:0.2f}%'.format(metrics.recall_score(y_test, y_test_pred_knn)))
cmKNN = metrics.confusion_matrix(y_test, y_test_pred_knn, labels = [1,0])
df_cmKNN = pd.DataFrame(cmKNN, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmKNN, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y_test, y_test_pred_knn, labels = [1,0]))
tempresultsdf = pd.DataFrame({'Technique w/o scaling': ['KNN'], 'Accuracy (%)': [metrics.accuracy_score(y_test, y_test_pred_knn)*100] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
from sklearn.svm import SVC
# Building a support vector Machine on train data
svc_model = SVC(C =0.1, kernel = 'linear', gamma = 1)
svc_fit = svc_model.fit(x_train, y_train)
y_train_pred_svc = svc_fit.predict(x_train)
print('Training Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_train,y_train_pred_svc)))
y_test_pred_svc = svc_fit.predict(x_test)
print('Testing Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_test, y_test_pred_svc)))
cmSVC = metrics.confusion_matrix(y_test, y_test_pred_svc, labels = [1,0])
df_cmSVC = pd.DataFrame(cmSVC, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [ i for i in ['Predict 1', 'Predict0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y_test,y_test_pred_svc, labels = [0,1]))
tempresultsdf = pd.DataFrame({'Technique w/o scaling': ['SVM'], 'Accuracy (%)': [accuracy_score(y_test, y_test_pred_svc)*100] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state = 559, max_features = 12 )
rfcl = rfcl.fit(x_train, y_train)
y_predict_rfcl = rfcl.predict(x_test)
print(rfcl.score(x_train, y_train))
print(rfcl.score(x_test, y_test))
cm_rfcl = metrics.confusion_matrix(y_test, y_predict_rfcl, labels = [1,0])
df_cm_rfcl = pd.DataFrame(cm_rfcl, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y_test,y_predict_rfcl, labels = [0,1]))
tempresultsdf = pd.DataFrame({'Technique w/o scaling': ['Random Forest w/o Tuning'], 'Accuracy (%)': [rfcl.score(x_test, y_test)*100] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
rfcl_tuned = RandomForestClassifier(criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 5, min_samples_leaf=5)
rfcl_tuned = rfcl_tuned.fit(x_train, y_train)
test_pred_tuned_rfcl = rfcl_tuned.predict(x_test)
print(rfcl_tuned.score(x_train, y_train))
print(rfcl_tuned.score(x_test, y_test))
cm_rfcl_tuned = metrics.confusion_matrix(y_test, test_pred_tuned_rfcl, labels = [1,0])
df_cm_rfcl_tuned = pd.DataFrame(cm_rfcl_tuned, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl_tuned, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
Compare to the model without tuning, here with tuning of the hyper parameter we could nullify the value of FN totally and controlled the model from overfitting
print(metrics.classification_report(y_test,test_pred_tuned_rfcl, labels = [0,1]))
f1-score and accuracy of the test model were decreased but, the model is free from overfitting and met our requirement to minimize the value of False Negative
tempresultsdf = pd.DataFrame({'Technique w/o scaling': ['Random Forest with Tuning'], 'Accuracy (%)': [rfcl_tuned.score(x_test, y_test)*100] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
# Importing and Fitting the XGBoost to the Training Set
xgbc = xgboost.XGBClassifier()
xgbc.fit(x_train, y_train)
# Predicting the Test set results
y_pred_xgbc = xgbc.predict(x_test)
print(xgbc.score(x_train, y_train))
print(xgbc.score(x_test, y_test))
cm_xgbc = metrics.confusion_matrix(y_test, y_pred_xgbc, labels = [1,0])
df_cm_xgbc = pd.DataFrame(cm_xgbc, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y_test,y_predict_rfcl, labels = [0,1]))
tempresultsdf = pd.DataFrame({'Technique w/o scaling': ['XGBoost w/o Tuning'], 'Accuracy (%)': [xgbc.score(x_test, y_test)*100] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
# 'learning_rate' - Model learning rate defined
# 'max_depth'- maximum depth of the tree we want to be, for minimizing overfitting of model
# 'min_child_weight'- minimum weight of the child required in our model
# 'gamma'- gamma value should be as low as possible to control the complexity of the model
# 'colsample_bytree'- Defining the subsample ratio of the columns for each column split
xgbc_tune = xgboost.XGBClassifier(booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5, gamma=0.4, learning_rate=0.1,
max_delta_step=0, max_depth=6, min_child_weight=7, missing=None,
n_estimators=100,random_state=298 )
xgbc_tune.fit(x_train, y_train)
# Predicting the Test set results
y_pred_xgbc_tune = xgbc_tune.predict(x_test)
print(xgbc_tune.score(x_train, y_train))
print(xgbc_tune.score(x_test, y_test))
cm_xgbc_tune = metrics.confusion_matrix(y_test, y_pred_xgbc_tune, labels = [1,0])
df_cm_xgbc_tune = pd.DataFrame(cm_xgbc_tune, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc_tune, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y_test,y_pred_xgbc_tune, labels = [0,1]))
tempresultsdf = pd.DataFrame({'Technique w/o scaling': ['XGBoost with Tuning'], 'Accuracy (%)': [xgbc_tune.score(x_test, y_test)*100] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Technique w/o scaling', 'Accuracy (%)']]
resultsdf
# importing necessary package to perform Stacking
from mlxtend.classifier import StackingClassifier
clf1 = BaggingClassifier(random_state = 39) # Sub-Model 1, considered as Bagging
clf2 = AdaBoostClassifier(random_state = 39) # Sub-Model 2, considered as AdaBoost
clf3 = RandomForestClassifier(random_state = 39) # Sub-Model 3, considered as RandomForest
clf4 = GaussianNB() # Sub-Model 4, considered as Gaussian NB
clf5 = LGBMClassifier(random_state = 39) # Sub-Model 5, considered as Light Gradient Boost
mc = GradientBoostingClassifier(random_state = 39) # Meta Model, considered as Gradient Boosting
# Assigning sub models and Meta classifir to the Stacking Classifier
sclf = StackingClassifier(classifiers = [clf1,clf2,clf3,clf4,clf5], meta_classifier = mc)
# Formation of a for loop to execute the Stacking classifier and print the accuracy with tollerance value
# Here we have considered Cross Validation value as 10. (as per noraml industry standard)
for clf, label in zip([clf1,clf2,clf3,clf4,clf5,sclf],['BaggingClassifier','AdaBoostClassifier', 'RandomForestClassifier', 'GaussianNB', 'LGBMClassifier','StackingClassifier']):
scores = model_selection.cross_val_score(clf, X, y, cv = 10, scoring= 'accuracy')
print('Accuracy: %0.2f (+/- %0.2f) [%s]'% (scores.mean(), scores.std(), label))
clf11 = KNeighborsClassifier(n_neighbors=1)
clf22 = RandomForestClassifier(random_state=1)
clf33 = GaussianNB()
lr = LogisticRegression()
# Here we are trying to use the accuracy of the model with the help of Probability method
sclf_Prb = StackingClassifier(classifiers = [clf11,clf22,clf33], use_probas=True,
average_probas=False, meta_classifier = lr)
for clf, label in zip([clf11,clf22,clf33,sclf_Prb],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier']):
scores = model_selection.cross_val_score(clf, X, y, cv = 3, scoring= 'accuracy')
print('Accuracy: %0.2f (+/- %0.2f) [%s]'% (scores.mean(), scores.std(), label))
# Here we have considered Cross Validation value as 3.
fig = plt.figure(figsize = (16,5))
plt.title ('Accuracy values for various models/techniques without any scaling',y=1, size = 20)
sns.barplot(y = resultsdf['Accuracy (%)'], x = resultsdf['Technique w/o scaling'] );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique without scaling', size = 20)
plt.tight_layout()
df1 = df
df1.columns
from sklearn.preprocessing import StandardScaler
std_scale = StandardScaler()
std_scale
# Creat list of numerical variables only
num_vars = [ 'MDVP_Fo_Hz', 'MDVP_Fhi_Hz', 'MDVP_Flo_Hz', 'MDVP_Jitter_per',
'MDVP_Jitter_Abs', 'MDVP_RAP', 'MDVP_PPQ', 'Jitter_DDP', 'MDVP_Shimmer',
'MDVP_Shimmer_dB', 'Shimmer_APQ3', 'Shimmer_APQ5', 'MDVP_APQ',
'Shimmer_DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'spread1',
'spread2', 'D2', 'PPE']
# Fit and transform the data
df1[num_vars] = std_scale.fit_transform(df1[num_vars])
print(df1.head(2))
from sklearn.model_selection import train_test_split
X1 = df1.drop('status', axis = 1) # predictor feature columns or independent columns (22 * m)
y1 = df1['status'] # target / predicted class (1 = Parkinson, 0 = Healthy) (1 * m)
y1.head()
x1_train, x1_test, y1_train, y1_test = train_test_split(X1,y1, test_size = 0.3, random_state = 537)
x1_train.head()
df1.describe().T
# Correlation of entire dataframe
corr_matrix = df1.corr()
# Features more related to Patient's Status
corr_matrix['status'].sort_values(ascending = False)
# Fit the model on train
model_lr = LogisticRegression(solver = 'liblinear')
model_lr.fit(x1_train, y1_train)
# Predict on test
y1_predict = model_lr.predict(x1_test)
# Checking for the coefficient / slope of all columns and the intercepter
coef_df = pd.DataFrame(model_lr.coef_)
coef_df['intercepter'] = model_lr.intercept_
print(coef_df)
model_score_train = model_lr.score(x1_train, y1_train)
print('The accuracy of train model is {0:0.2f}%'.format(model_score_train*100))
model_score_test = model_lr.score(x1_test, y1_test)
print("The accuracy of Test model is {0:0.2f}% ".format(model_score_test*100))
cm = metrics.confusion_matrix(y1_test, y1_predict, labels = [1,0])
df_cm = pd.DataFrame(cm, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (10,5))
sns.heatmap(df_cm, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red' );
print(metrics.classification_report(y1_test, y1_predict))
resultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['Logistic Regression'], 'Accuracy(%)': [model_score_test*100]})
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
from sklearn.naive_bayes import GaussianNB
model_GNB = GaussianNB()
model_GNB_fit = model_GNB.fit(x1_train, y1_train)
Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.
y1_train_pred_GNB = model_GNB.predict(x1_train)
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score
print('Training model Accuracy: {0:0.2f}%'.format(metrics.accuracy_score(y1_train, y1_train_pred_GNB)*100))
print()
y1_test_pred_GNB = model_GNB.predict(x1_test)
test_score_GNB = model_GNB.score(x1_test, y1_test)
# print('Testing Model Accuracy: {0:0.2f}%'.format(metrics.accuracy_score(y1_test, y1_test_pred_GNB)))
print('Testing Model Accuracy: {0:0.2f}%'.format(test_score_GNB*100))
cmGNB1 = metrics.confusion_matrix(y1_test, y1_test_pred_GNB, labels = [1,0])
df_cmGNB1 = pd.DataFrame(cmGNB1, index = [ i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmGNB1, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y1_test, y1_test_pred_GNB, labels = [1,0]))
Here f1-score is more than 70% for class 1, however the FN value is coming very high..
tempresultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['GNB'], 'Accuracy(%)': [test_score_GNB*100]})
resultsdf_scale1 = pd.concat([resultsdf_scale1, tempresultsdf_scale1])
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
Build the model and train, test on Training Set and Testing Set Respectively using Scikit-Learn. Print the accuracy of the model with different values of k = 3, 5, 9
# Learning the model (k=3)
knn3 = KNeighborsClassifier(n_neighbors = 3)
# Fitting the model
knn_fit3 = knn3.fit(x1_train, y1_train)
y1_train_pred_knn3 = knn_fit3.predict(x1_train)
print('Training Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y1_train, y1_train_pred_knn3)))
y1_test_pred_knn3 = knn_fit3.predict(x1_test)
print('Testing Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y1_test, y1_test_pred_knn3)))
print()
# Learning the model (k=5)
knn5 = KNeighborsClassifier(n_neighbors = 5)
# Fitting the model
knn_fit5 = knn5.fit(x1_train, y1_train)
y1_train_pred_knn5 = knn_fit5.predict(x1_train)
print('Training Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y1_train, y1_train_pred_knn5)))
y1_test_pred_knn5 = knn_fit5.predict(x1_test)
print('Testing Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y1_test, y1_test_pred_knn5)))
print()
# Learning the model (k=9)
knn9 = KNeighborsClassifier(n_neighbors = 9)
# Fitting the model
knn_fit9 = knn9.fit(x1_train, y1_train)
y1_train_pred_knn9 = knn_fit9.predict(x1_train)
print('Training Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y1_train, y1_train_pred_knn9)))
y1_test_pred_knn9 = knn_fit9.predict(x1_test)
print('Testing Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y1_test, y1_test_pred_knn9)))
print()
# Creating odd list of k for KNN
mylist = list(range(1,20))
# Substituting just the odd ones
neighbors = list(filter(lambda x: x % 2!=0, mylist))
# Creating an empty list to hold accurcy scores
ac_scores = []
# Performing accuracy metrics for values from 1,3,5,7...19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(x1_train, y1_train)
# predict the response
y1_predKNN = knn.predict(x1_test)
# Evaluate accuracy
scores = accuracy_score(y1_test, y1_predKNN)
ac_scores.append(scores)
# Changing to miscalssification error
MSE = [1-x for x in ac_scores]
# Determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print('The Optimal number of neighbors is %d'% optimal_k)
# Misclassification error Vs. K
plt.plot(neighbors, MSE, color = 'red')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
# Using k = 7 for the final model for prediction
knn = KNeighborsClassifier(n_neighbors = 1)
# Fitting the model
knn_fit = knn.fit(x1_train, y1_train)
y1_train_pred_knn = knn_fit.predict(x1_train)
print('Training Model Accuracy value for k=7: {0:0.2f}% '.format(metrics.accuracy_score(y1_train, y1_train_pred_knn)))
print('Training Model recall value for k = 7:{0:0.2f}%'.format(metrics.recall_score(y1_train, y1_train_pred_knn)))
print()
y1_test_pred_knn = knn_fit.predict(x1_test)
print('Testing Model Accuracy value for k=7: {0:0.2f}%'.format(metrics.accuracy_score(y1_test, y1_test_pred_knn)))
print('Testing Model Recall value for k=7: {0:0.2f}%'.format(metrics.recall_score(y1_test, y1_test_pred_knn)))
cmKNN1 = metrics.confusion_matrix(y1_test, y1_test_pred_knn, labels = [1,0])
df_cmKNN1 = pd.DataFrame(cmKNN1, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmKNN1, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y1_test, y1_test_pred_knn, labels = [1,0]))
tempresultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['KNN'], 'Accuracy(%)': [metrics.accuracy_score(y1_test, y1_test_pred_knn)*100]})
resultsdf_scale1 = pd.concat([resultsdf_scale1, tempresultsdf_scale1])
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
# Building a support vector Machine on train data
svc_model = SVC(C =0.1, kernel = 'linear', gamma = 1)
svc_fit = svc_model.fit(x1_train, y1_train)
y1_train_pred_svc = svc_fit.predict(x1_train)
print('Training Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y1_train,y1_train_pred_svc)))
y1_test_pred_svc = svc_fit.predict(x1_test)
print('Testing Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y1_test, y1_test_pred_svc)))
cmSVC1= metrics.confusion_matrix(y1_test, y1_test_pred_svc, labels = [1,0])
df_cmSVC1 = pd.DataFrame(cmSVC1, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [ i for i in ['Predict 1', 'Predict0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC1, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y1_test,y1_test_pred_svc, labels = [0,1]))
tempresultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['SVM'], 'Accuracy(%)': [accuracy_score(y1_test, y1_test_pred_svc)*100]})
resultsdf_scale1 = pd.concat([resultsdf_scale1, tempresultsdf_scale1])
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
rfcl = RandomForestClassifier(n_estimators = 50, random_state = 559, max_features = 12 )
rfcl = rfcl.fit(x1_train, y1_train)
y1_predict_rfcl = rfcl.predict(x1_test)
print(rfcl.score(x1_train, y1_train))
print(rfcl.score(x1_test, y1_test))
cm_rfcl1 = metrics.confusion_matrix(y1_test, y1_predict_rfcl, labels = [1,0])
df_cm_rfcl1 = pd.DataFrame(cm_rfcl1, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl1, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y1_test,y1_predict_rfcl, labels = [0,1]))
tempresultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['Random Forest w/o tuning'], 'Accuracy(%)': [rfcl.score(x1_test, y1_test)*100]})
resultsdf_scale1 = pd.concat([resultsdf_scale1, tempresultsdf_scale1])
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
rfcl_tuned = RandomForestClassifier(n_estimators = 50, random_state = 559, criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 4, min_samples_leaf=4)
rfcl_tuned = rfcl_tuned.fit(x1_train, y1_train)
test_pred_tuned_rfcl1 = rfcl_tuned.predict(x1_test)
print(rfcl_tuned.score(x1_train, y1_train))
print(rfcl_tuned.score(x1_test, y1_test))
cm_rfcl_tuned1 = metrics.confusion_matrix(y1_test, test_pred_tuned_rfcl1, labels = [1,0])
df_cm_rfcl_tuned1 = pd.DataFrame(cm_rfcl_tuned1, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl_tuned1, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
Comparing to the model without tuning, here with tuning of the hyper parameter we could nullify the value of FN totally and controlled the model from overfitting
print(metrics.classification_report(y1_test,test_pred_tuned_rfcl1, labels = [0,1]))
f1-score and accuracy of the test model were decreased but, the model is free from overfitting and met our requirement to minimize the value of False Negative
tempresultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['Random Forest with tuning'], 'Accuracy(%)': [rfcl_tuned.score(x1_test, y1_test)*100]})
resultsdf_scale1 = pd.concat([resultsdf_scale1, tempresultsdf_scale1])
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
# calling and initializing the XGBoost classifier
xgbc = xgboost.XGBClassifier()
xgbc.fit(x1_train, y1_train)
# Predicting the Test set results
y1_pred_xgbc = xgbc.predict(x1_test)
print(xgbc.score(x1_train, y1_train))
print(xgbc.score(x1_test, y1_test))
cm_xgbc1 = metrics.confusion_matrix(y1_test, y1_pred_xgbc, labels = [1,0])
df_cm_xgbc1 = pd.DataFrame(cm_xgbc1, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc1, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y1_test,y1_pred_xgbc, labels = [0,1]))
tempresultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['XGBoost w/o tuning'], 'Accuracy(%)': [xgbc.score(x1_test, y1_test)*100]})
resultsdf_scale1 = pd.concat([resultsdf_scale1, tempresultsdf_scale1])
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
# 'learning_rate' - Model learning rate defined
# 'max_depth'- maximum depth of the tree we want to be, for minimizing overfitting of model
# 'min_child_weight'- minimum weight of the child required in our model
# 'gamma'- gamma value should be as low as possible to control the complexity of the model
# 'colsample_bytree'- Defining the subsample ratio of the columns for each column split
xgbc_tune1 = xgboost.XGBClassifier(booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5, gamma=0.4, learning_rate=0.1,
max_delta_step=0, max_depth=6, min_child_weight=7, missing=None,
n_estimators=100,random_state=298 )
xgbc_tune1.fit(x1_train, y1_train)
# Predicting the Test set results
y1_pred_xgbc_tune = xgbc_tune1.predict(x1_test)
print(xgbc_tune1.score(x1_train, y1_train))
print(xgbc_tune1.score(x1_test, y1_test))
cm_xgbc_tune1 = metrics.confusion_matrix(y1_test, y1_pred_xgbc_tune, labels = [1,0])
df_cm_xgbc_tune1 = pd.DataFrame(cm_xgbc_tune1, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc_tune1, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y1_test,y1_pred_xgbc_tune, labels = [0,1]))
tempresultsdf_scale1 = pd.DataFrame({'Method_scaling(Std. Scalar)': ['XGBoost with tuning'], 'Accuracy(%)': [xgbc_tune1.score(x1_test, y1_test)*100]})
resultsdf_scale1 = pd.concat([resultsdf_scale1, tempresultsdf_scale1])
resultsdf_scale1 = resultsdf_scale1[['Method_scaling(Std. Scalar)', 'Accuracy(%)']]
resultsdf_scale1
# assigning various techniques / models to be used in stacking
clf1 = BaggingClassifier(random_state = 39)
clf2 = AdaBoostClassifier(random_state = 39)
clf3 = RandomForestClassifier(random_state = 39)
clf4 = GaussianNB()
clf5 = LGBMClassifier(random_state = 39)
# making a meatal model for stacking
mc = GradientBoostingClassifier(random_state = 39)
# assigning the stacking model to perform
sclf = StackingClassifier(classifiers = [clf1,clf2,clf3,clf4,clf5], meta_classifier = mc)
# checking for accuracy for both meta model and normal model
for clf, label in zip([clf1,clf2,clf3,clf4,clf5,sclf],['BaggingClassifier','AdaBoostClassifier', 'RandomForestClassifier', 'GaussianNB', 'LGBMClassifier','StackingClassifier']):
scores = model_selection.cross_val_score(clf, X1, y1, cv = 10, scoring= 'accuracy')
print('Accuracy: %0.2f (+/- %0.2f) [%s]'% (scores.mean(), scores.std(), label))
clf11 = KNeighborsClassifier(n_neighbors=1)
clf22 = RandomForestClassifier(random_state=1)
clf33 = GaussianNB()
lr = LogisticRegression()
sclf_Prb = StackingClassifier(classifiers = [clf11,clf22,clf33], use_probas=True,
average_probas=False, meta_classifier = lr)
for clf, label in zip([clf11,clf22,clf33,sclf_Prb],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier']):
scores = model_selection.cross_val_score(clf, X1, y1, cv = 3, scoring= 'accuracy')
print('Accuracy: %0.2f (+/- %0.2f) [%s]'% (scores.mean(), scores.std(), label))
resultsdf_scale1
fig = plt.figure(figsize = (16,5))
plt.title ('Accuracy values for various models/techniques with Scaling (Standard Scalar)',y=1, size = 20)
sns.barplot(y = resultsdf_scale1['Accuracy(%)'], x = resultsdf_scale1['Method_scaling(Std. Scalar)'] );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique with scaling via Standard Scalar', size = 20)
plt.tight_layout()
df2 = df
df2.columns
from sklearn.preprocessing import MinMaxScaler
minmax_scale = MinMaxScaler()
minmax_scale
# Creat list of numerical variables only
num_vars = [ 'MDVP_Fo_Hz', 'MDVP_Fhi_Hz', 'MDVP_Flo_Hz', 'MDVP_Jitter_per',
'MDVP_Jitter_Abs', 'MDVP_RAP', 'MDVP_PPQ', 'Jitter_DDP', 'MDVP_Shimmer',
'MDVP_Shimmer_dB', 'Shimmer_APQ3', 'Shimmer_APQ5', 'MDVP_APQ',
'Shimmer_DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'spread1',
'spread2', 'D2', 'PPE']
# Fit and transform the data
df2[num_vars] = minmax_scale.fit_transform(df2[num_vars])
print(df2.tail(2))
from sklearn.model_selection import train_test_split
X2 = df2.drop('status', axis = 1) # predictor feature columns or independent columns (22 * m)
y2 = df2['status'] # target / predicted class (1 = Parkinson, 0 = Healthy) (1 * m)
y2.head()
df2.describe().T
x2_train, x2_test, y2_train, y2_test = train_test_split(X2,y2, test_size = 0.3, random_state = 537)
x2_train.head()
# Correlation of entire dataframe
corr_matrix = df2.corr()
# Features more related to Patient's Status
corr_matrix['status'].sort_values(ascending = False)
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# Fit the model on train
model_lr = LogisticRegression(solver = 'liblinear')
model_lr.fit(x2_train, y2_train)
# Predict on test
y2_predict = model_lr.predict(x2_test)
# Checking for the coefficient / slope of all columns and the intercepter
coef_df = pd.DataFrame(model_lr.coef_)
coef_df['intercepter'] = model_lr.intercept_
print(coef_df)
model_score_train = model_lr.score(x2_train, y2_train)
print('The accuracy of train model is {0:0.2f}%'.format(model_score_train*100))
model_score_test = model_lr.score(x2_test, y2_test)
print("The accuracy of Test model is {0:0.2f}% ".format(model_score_test*100))
cm2 = metrics.confusion_matrix(y2_test, y2_predict, labels = [1,0])
df_cm2 = pd.DataFrame(cm2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (10,5))
sns.heatmap(df_cm2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red' );
print(metrics.classification_report(y2_test, y2_predict))
resultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['Logistic Regression'], 'Accuracy(%)': [model_score_test*100]})
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
from sklearn.naive_bayes import GaussianNB
model_GNB = GaussianNB()
model_GNB_fit = model_GNB.fit(x2_train, y2_train)
Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.
# To calculate the accuracy score of the model
y2_train_pred_GNB = model_GNB.predict(x2_train)
from sklearn.metrics import accuracy_score
print('Training model Accuracy: {0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_GNB)))
print()
y2_test_pred_GNB = model_GNB.predict(x2_test)
print('Testing Model Accuracy: {0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_GNB)))
cmGNB2 = metrics.confusion_matrix(y2_test, y2_test_pred_GNB, labels = [1,0])
df_cmGNB2 = pd.DataFrame(cmGNB2, index = [ i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmGNB2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test, y2_test_pred_GNB, labels = [1,0]))
Here f1-score is slightly more than 70%, however the FN value is coming very high..
tempresultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['GNB'], 'Accuracy(%)': [metrics.accuracy_score(y2_test, y2_test_pred_GNB)*100]})
resultsdf_minmax = pd.concat([resultsdf_minmax, tempresultsdf_minmax])
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
Build the model and train, test on Training Set and Testing Set Respectively using Scikit-Learn. Print the accuracy of the model with different values of k = 3, 5, 9
# Loading Library
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
# Learning the model (k=3)
knn3 = KNeighborsClassifier(n_neighbors = 3)
# Fitting the model
knn_fit3 = knn3.fit(x2_train, y2_train)
y2_train_pred_knn3 = knn_fit3.predict(x2_train)
print('Training Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_knn3)))
y2_test_pred_knn3 = knn_fit3.predict(x2_test)
print('Testing Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn3)))
print()
# Learning the model (k=5)
knn5 = KNeighborsClassifier(n_neighbors = 5)
# Fitting the model
knn_fit5 = knn5.fit(x2_train, y2_train)
y2_train_pred_knn5 = knn_fit5.predict(x2_train)
print('Training Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_knn5)))
y2_test_pred_knn5 = knn_fit5.predict(x2_test)
print('Testing Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn5)))
print()
# Learning the model (k=9)
knn9 = KNeighborsClassifier(n_neighbors = 9)
# Fitting the model
knn_fit9 = knn9.fit(x2_train, y2_train)
y2_train_pred_knn9 = knn_fit9.predict(x2_train)
print('Training Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_knn9)))
y2_test_pred_knn9 = knn_fit9.predict(x2_test)
print('Testing Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn9)))
print()
# Creating odd list of k for KNN
mylist = list(range(1,20))
# Substituting just the odd ones
neighbors = list(filter(lambda x: x % 2!=0, mylist))
# Creating an empty list to hold accurcy scores
ac_scores = []
# Performing accuracy metrics for values from 1,3,5,7...19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(x2_train, y2_train)
# predict the response
y2_predKNN = knn.predict(x2_test)
# Evaluate accuracy
scores = accuracy_score(y2_test, y2_predKNN)
ac_scores.append(scores)
# Changing to miscalssification error
MSE = [1-x for x in ac_scores]
# Determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print('The Optimal number of neighbors is %d'% optimal_k)
# Misclassification error Vs. K
plt.plot(neighbors, MSE, color = 'red')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
# Using k = 7 for the final model for prediction
knn = KNeighborsClassifier(n_neighbors = 5)
# Fitting the model
knn_fit = knn.fit(x2_train, y2_train)
y2_train_pred_knn = knn_fit.predict(x2_train)
print('Training Model Accuracy value for k=7: {0:0.2f}% '.format(metrics.accuracy_score(y2_train, y2_train_pred_knn)))
print('Training Model recall value for k = 7:{0:0.2f}%'.format(metrics.recall_score(y2_train, y2_train_pred_knn)))
print()
y2_test_pred_knn = knn_fit.predict(x2_test)
print('Testing Model Accuracy value for k=7: {0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn)))
print('Testing Model Recall value for k=7: {0:0.2f}%'.format(metrics.recall_score(y2_test, y2_test_pred_knn)))
cmKNN2 = metrics.confusion_matrix(y2_test, y2_test_pred_knn, labels = [1,0])
df_cmKNN2 = pd.DataFrame(cmKNN2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmKNN2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test, y2_test_pred_knn, labels = [1,0]))
tempresultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['KNN'], 'Accuracy(%)': [metrics.accuracy_score(y2_test, y2_test_pred_knn)*100]})
resultsdf_minmax = pd.concat([resultsdf_minmax, tempresultsdf_minmax])
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
from sklearn.svm import SVC
# Building a support vector Machine on train data
svc_model = SVC(C =0.1, kernel = 'linear', gamma = 1)
svc_fit = svc_model.fit(x2_train, y2_train)
y2_train_pred_svc = svc_fit.predict(x2_train)
print('Training Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y2_train,y2_train_pred_svc)))
y2_test_pred_svc = svc_fit.predict(x2_test)
print('Testing Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y2_test, y2_test_pred_svc)))
cmSVC2 = metrics.confusion_matrix(y2_test, y2_test_pred_svc, labels = [1,0])
df_cmSVC2 = pd.DataFrame(cmSVC2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [ i for i in ['Predict 1', 'Predict0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_test_pred_svc, labels = [0,1]))
tempresultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['SVM'], 'Accuracy(%)': [accuracy_score(y2_test, y2_test_pred_svc)*100]})
resultsdf_minmax = pd.concat([resultsdf_minmax, tempresultsdf_minmax])
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state = 559, max_features = 12 )
rfcl = rfcl.fit(x2_train, y2_train)
y2_predict_rfcl = rfcl.predict(x2_test)
print(rfcl.score(x2_train, y2_train))
print(rfcl.score(x2_test, y2_test))
cm_rfcl2 = metrics.confusion_matrix(y2_test, y2_predict_rfcl, labels = [1,0])
df_cm_rfcl2 = pd.DataFrame(cm_rfcl2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_predict_rfcl, labels = [0,1]))
tempresultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['Random Forest w/o tuning'], 'Accuracy(%)': [rfcl.score(x2_test, y2_test)*100]})
resultsdf_minmax = pd.concat([resultsdf_minmax, tempresultsdf_minmax])
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
rfcl_tuned = RandomForestClassifier(n_estimators = 50, random_state = 559, criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 4, min_samples_leaf=4)
rfcl_tuned = rfcl_tuned.fit(x2_train, y2_train)
test_pred_tuned_rfcl2 = rfcl_tuned.predict(x2_test)
print(rfcl_tuned.score(x2_train, y2_train))
print(rfcl_tuned.score(x2_test, y2_test))
cm_rfcl_tuned2 = metrics.confusion_matrix(y2_test, test_pred_tuned_rfcl2, labels = [1,0])
df_cm_rfcl_tuned2 = pd.DataFrame(cm_rfcl_tuned2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl_tuned2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
Compare to the model without tuning, here with tuning of the hyper parameter we could nullify the value of FN totally and controlled the model from overfitting
print(metrics.classification_report(y2_test,test_pred_tuned_rfcl2, labels = [0,1]))
f1-score and accuracy of the test model were decreased but, the model is free from overfitting and met our requirement to minimize the value of False Negative
tempresultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['Random Forest with tuning'], 'Accuracy(%)': [rfcl_tuned.score(x2_test, y2_test)*100]})
resultsdf_minmax = pd.concat([resultsdf_minmax, tempresultsdf_minmax])
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
xgbc = xgboost.XGBClassifier()
xgbc.fit(x2_train, y2_train)
# Predicting the Test set results
y2_pred_xgbc = xgbc.predict(x2_test)
print(xgbc.score(x2_train, y2_train))
print(xgbc.score(x2_test, y2_test))
cm_xgbc2 = metrics.confusion_matrix(y2_test, y2_pred_xgbc, labels = [1,0])
df_cm_xgbc2 = pd.DataFrame(cm_xgbc2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_pred_xgbc, labels = [0,1]))
tempresultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['XGBoost w/o tuning'], 'Accuracy(%)': [xgbc.score(x2_test, y2_test)*100]})
resultsdf_minmax = pd.concat([resultsdf_minmax, tempresultsdf_minmax])
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
# 'learning_rate' - Model learning rate defined
# 'max_depth'- maximum depth of the tree we want to be, for minimizing overfitting of model
# 'min_child_weight'- minimum weight of the child required in our model
# 'gamma'- gamma value should be as low as possible to control the complexity of the model
# 'colsample_bytree'- Defining the subsample ratio of the columns for each column split
xgbc_tune2 = xgboost.XGBClassifier(booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5, gamma=0.4, learning_rate=0.1,
max_delta_step=0, max_depth=5, min_child_weight=7, missing=None,
n_estimators=100,random_state=293 )
xgbc_tune2.fit(x2_train, y2_train)
# Predicting the Test set results
y2_pred_xgbc_tune = xgbc_tune2.predict(x2_test)
print(xgbc_tune2.score(x2_train, y2_train))
print(xgbc_tune2.score(x2_test, y2_test))
cm_xgbc_tune2 = metrics.confusion_matrix(y2_test, y2_pred_xgbc_tune, labels = [1,0])
df_cm_xgbc_tune2 = pd.DataFrame(cm_xgbc_tune2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc_tune2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_pred_xgbc_tune, labels = [0,1]))
tempresultsdf_minmax = pd.DataFrame({'Method_scaling(MinMax Scalar)': ['XGBoost with tuning'], 'Accuracy(%)': [xgbc_tune2.score(x2_test, y2_test)*100]})
resultsdf_minmax = pd.concat([resultsdf_minmax, tempresultsdf_minmax])
resultsdf_minmax = resultsdf_minmax[['Method_scaling(MinMax Scalar)', 'Accuracy(%)']]
resultsdf_minmax
# assigning various techniques / models to be used in stacking
clf1 = BaggingClassifier(random_state = 39)
clf2 = AdaBoostClassifier(random_state = 39)
clf3 = RandomForestClassifier(random_state = 39)
clf4 = GaussianNB()
clf5 = LGBMClassifier(random_state = 39)
# making a meatal model for stacking
mc = GradientBoostingClassifier(random_state = 39)
# assigning the stacking model to perform
sclf = StackingClassifier(classifiers = [clf1,clf2,clf3,clf4,clf5], meta_classifier = mc)
# checking for accuracy for both meta model and normal model
for clf, label in zip([clf1,clf2,clf3,clf4,clf5,sclf],['BaggingClassifier','AdaBoostClassifier', 'RandomForestClassifier', 'GaussianNB', 'LGBMClassifier','StackingClassifier']):
scores = model_selection.cross_val_score(clf, X2, y2, cv = 10, scoring= 'accuracy')
print('Accuracy: %0.2f (+/- %0.2f) [%s]'% (scores.mean(), scores.std(), label))
clf11 = KNeighborsClassifier(n_neighbors=1)
clf22 = RandomForestClassifier(random_state=1)
clf33 = GaussianNB()
lr = LogisticRegression()
sclf_Prb = StackingClassifier(classifiers = [clf11,clf22,clf33], use_probas=True,
average_probas=False, meta_classifier = lr)
for clf, label in zip([clf11,clf22,clf33,sclf_Prb],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier']):
scores = model_selection.cross_val_score(clf, X2, y2, cv = 3, scoring= 'accuracy')
print('Accuracy: %0.2f (+/- %0.2f) [%s]'% (scores.mean(), scores.std(), label))
resultsdf_minmax
fig = plt.figure(figsize = (18,5))
plt.title ('Accuracy values for various models/techniques with Scaling (MinMax Scalar)',y=1, size = 20)
sns.barplot(y = resultsdf_minmax['Accuracy(%)'], x = resultsdf_minmax['Method_scaling(MinMax Scalar)'] );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique with scaling via MinMax Scalar', size = 20)
plt.tight_layout()
Here we will be checking the p-value and correlation of each independent columns, based upon the correlation value (>0.95), we will delete the columns, and by taking some selective columns with MinMax Sclar normalisation, we will be checking the accuracy values of all models.
# Fit the model on train
model_lr = LogisticRegression(solver = 'liblinear')
model_lr.fit(x2_train, y2_train)
# Predict on test
y2_predict = model_lr.predict(x2_test)
model_score_train = model_lr.score(x2_train, y2_train)
print('The accuracy of train model is {0:0.2f}%'.format(model_score_train*100))
model_score_test = model_lr.score(x2_test, y2_test)
print("The accuracy of Test model is {0:0.2f}% ".format(model_score_test*100))
logit1 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg1 = logit1.fit()
lg1.summary()
The above error is coming may be due to Duplicate columns or the columns having a correlation score of 1.0 resulting in singularity, or we have two features that are perfectly correleted. Thus to avoid such kind of problem we will have to drop some of the columns before doing any further operation.
Columns such as MDVP_RAP and Jitter_DDP; Shimmer_APQ3 & Shimmer_DDA; are having a correlation value of 1.0 with each other. Hence, here will drop any two among them.
x2_train = x2_train.drop(['MDVP_RAP', 'Shimmer_APQ3' ], axis =1)
#Fit the model on train
model_lr1 = LogisticRegression(solver = 'liblinear')
model_lr1.fit(x2_train, y2_train)
# Predict on test
x2_test = x2_test.drop(['MDVP_RAP', 'Shimmer_APQ3' ], axis =1)
y2_predict1 = model_lr1.predict(x2_test)
model_score_train1 = model_lr1.score(x2_train, y2_train)
print('The accuracy of train model is {0:0.2f}%'.format(model_score_train1*100))
model_score_test1 = model_lr1.score(x2_test, y2_test)
print("The accuracy of Test model is {0:0.2f}% ".format(model_score_test1*100))
logit1 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg1 = logit1.fit()
lg1.summary()
#ways of dropping variables:
#significance of variables (p-values)
#VIF--variance inflation factor
#Computing VIF
#VIF=1/1-r^2
#create a dataframe which will contain all the features and their respective VIF values
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Assumption:
a) A VIF value of 1 indicates there is no relationship / correlation between independent variables and any others.
b) A VIF value between 1 to 5 shows a moderate correlation but not severe enough to warrant corrective measures.
c) A VIF value greater than 5 shows critical levels of multicollinearity and the p value to be checked for.
Thus, Here, we will remove the one by one which are having VIF>5 and p > 0.05
The column 'MDVP_Shimmer' has a VIF of 2321.77 and p-value of 0.337, which is far above our threshold value of 5 and 0.05 respectively. Therefor, we will drop the column first and again check for the VIF value of others.
x2_train = x2_train.drop(['MDVP_Shimmer'], axis = 1)
logit2 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg2 = logit2.fit()
lg2.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
The column 'MDVP_Jitter_per' has a VIF of 395.50 and p-value of 0.425, which is far above our threshold value of 5 and 0.05 respectively. Therefor, we will drop the column first and again check for the VIF value of others.
x2_train = x2_train.drop(['MDVP_Jitter_per'], axis = 1)
logit3 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg3 = logit3.fit()
lg3.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
The column 'MDVP_Shimmer_dB' has a VIF of 198.10 and p-value of 0.757, which is far above our threshold value of 5 and 0.05 respectively. Therefor, we will drop the column first and again check for the VIF value of others.
x2_train = x2_train.drop(['MDVP_Shimmer_dB'], axis = 1)
logit4 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg4 = logit4.fit()
lg4.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
The column 'Jitter_DDP' has a VIF of 144.82 which is far above our threshold value of 5 and p-value of 0.004, which is far below our threshold value of 0.05 therefore, we will retain the same, and move on to next feature to check.
The column 'PPE' has a VIF of 120.88 and p-value of 0.305, which is far above our threshold value of 5 and 0.05 respectively.
Therefor, we will drop the column first and again check for the VIF value of others..
x2_train = x2_train.drop(['PPE'], axis = 1)
logit5 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg5 = logit5.fit()
lg5.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 136.29 and p-score as 0.004, we will delete the column 'Jitter_DDP'.
x2_train = x2_train.drop(['Jitter_DDP'], axis = 1)
logit6 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg6 = logit6.fit()
lg6.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 67.44 (>5) and p-score as 0.324 (>0.05), we will delete the column 'Shimmer_APQ5'.
x2_train = x2_train.drop(['Shimmer_APQ5'], axis = 1)
logit7 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg7 = logit7.fit()
lg7.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 53.94 (>5) and p-score as 0.057 (>0.05), we will delete the column 'MDVP_Jitter_Abs'.
x2_train = x2_train.drop(['MDVP_Jitter_Abs'], axis = 1)
logit8 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg8 = logit8.fit()
lg8.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 28.95 (>5) and p-score as 0.008 (<0.05), we will delete the column 'spread1'.
x2_train = x2_train.drop(['spread1'], axis = 1)
logit9 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg9 = logit9.fit()
lg9.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 18.44 (>5) and p-score as 0.915 (>0.05), we will delete the column 'HNR'.
x2_train = x2_train.drop(['HNR'], axis = 1)
logit10 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg10 = logit10.fit()
lg10.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 16.80 (>5) and p-score as 0.043 (<0.05), we will delete the column 'D2'.
x2_train = x2_train.drop(['D2'], axis = 1)
logit11 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg11 = logit11.fit()
lg11.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 15.24 (>5) and p-score as 0.570 (>0.05), we will delete the column 'Shimmer_DDA'.
x2_train = x2_train.drop(['Shimmer_DDA'], axis = 1)
logit12 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg12 = logit12.fit()
lg12.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 12.54 (>5) and p-score as 0.144 (>0.05), we will delete the column 'MDVP_PPQ'.
x2_train = x2_train.drop(['MDVP_PPQ'], axis = 1)
logit13 = sm.Logit(y2_train, sm.add_constant(x2_train))
lg13 = logit13.fit()
lg13.summary()
vif=pd.DataFrame()
vif['Features'] = x2_train.columns
vif['VIF'] = [variance_inflation_factor(x2_train.values,i) for i in range(x2_train.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Considering the VIF value as 11.77 (>5) and p-score as 0.003 (<0.05) for the column 'spread2', here onwards we will retain all the columns and calculate the accuracy for models
x2_test = x2_test.drop([ 'MDVP_Jitter_per','MDVP_Jitter_Abs','MDVP_PPQ', 'Jitter_DDP', 'MDVP_Shimmer','MDVP_Shimmer_dB',
'Shimmer_APQ5','Shimmer_DDA', 'HNR', 'spread1','D2', 'PPE'], axis =1)
#Fit the model on train
model_lr2 = LogisticRegression(solver = 'liblinear')
model_lr2.fit(x2_train, y2_train)
#predict on test
y2_predict2 = model_lr2.predict(x2_test)
model_score_train2 = model_lr2.score(x2_train, y2_train)
print('The accuracy of train model is {0:0.2f}%'.format(model_score_train2*100))
model_score_test2 = model_lr2.score(x2_test, y2_test)
print("The accuracy of Test model is {0:0.2f}% ".format(model_score_test2*100))
cm_VIF = metrics.confusion_matrix(y2_test, y2_predict2, labels = [1,0])
df_cm_VIF = pd.DataFrame(cm_VIF, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (10,5))
sns.heatmap(df_cm_VIF, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red' );
print(metrics.classification_report(y2_test, y2_predict2))
resultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['Logistic Regression'], 'Accuracy(%)': [model_score_test2*100]})
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
from sklearn.naive_bayes import GaussianNB
model_GNB = GaussianNB()
model_GNB_fit = model_GNB.fit(x2_train, y2_train)
Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.
# To calculate the accuracy score of the model
y2_train_pred_GNB_VIF = model_GNB.predict(x2_train)
from sklearn.metrics import accuracy_score
print('Training model Accuracy: {0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_GNB_VIF)))
print()
y2_test_pred_GNB_VIF = model_GNB.predict(x2_test)
print('Testing Model Accuracy: {0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_GNB_VIF)))
cmGNB2_VIF = metrics.confusion_matrix(y2_test, y2_test_pred_GNB_VIF, labels = [1,0])
df_cmGNB2_VIF = pd.DataFrame(cmGNB2_VIF, index = [ i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmGNB2_VIF, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test, y2_test_pred_GNB_VIF, labels = [1,0]))
Here f1-score is slightly more than 70%, however the FN value has decreased from the above GNB models (a drastical decrease has been observed after scaling the dataset with the help of VIF)..
tempresultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['GNB'], 'Accuracy(%)': [metrics.accuracy_score(y2_test, y2_test_pred_GNB_VIF)*100]})
resultsdf_minmax_VIF = pd.concat([resultsdf_minmax_VIF, tempresultsdf_minmax_VIF])
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
Build the model and train, test on Training Set and Testing Set Respectively using Scikit-Learn. Print the accuracy of the model with different values of k = 3, 5, 9
# Loading Library
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
# Learning the model (k=3)
knn3 = KNeighborsClassifier(n_neighbors = 3)
# Fitting the model
knn_fit3 = knn3.fit(x2_train, y2_train)
y2_train_pred_knn3 = knn_fit3.predict(x2_train)
print('Training Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_knn3)))
y2_test_pred_knn3 = knn_fit3.predict(x2_test)
print('Testing Model Accuracy for k = 3:{0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn3)))
print()
# Learning the model (k=5)
knn5 = KNeighborsClassifier(n_neighbors = 5)
# Fitting the model
knn_fit5 = knn5.fit(x2_train, y2_train)
y2_train_pred_knn5 = knn_fit5.predict(x2_train)
print('Training Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_knn5)))
y2_test_pred_knn5 = knn_fit5.predict(x2_test)
print('Testing Model Accuracy for k = 5:{0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn5)))
print()
# Learning the model (k=9)
knn9 = KNeighborsClassifier(n_neighbors = 9)
# Fitting the model
knn_fit9 = knn9.fit(x2_train, y2_train)
y2_train_pred_knn9 = knn_fit9.predict(x2_train)
print('Training Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y2_train, y2_train_pred_knn9)))
y2_test_pred_knn9 = knn_fit9.predict(x2_test)
print('Testing Model Accuracy for k = 9:{0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn9)))
print()
# Creating odd list of k for KNN
mylist = list(range(1,20))
# Substituting just the odd ones
neighbors = list(filter(lambda x: x % 2!=0, mylist))
# Creating an empty list to hold accurcy scores
ac_scores = []
# Performing accuracy metrics for values from 1,3,5,7...19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(x2_train, y2_train)
# predict the response
y2_predKNN = knn.predict(x2_test)
# Evaluate accuracy
scores = accuracy_score(y2_test, y2_predKNN)
ac_scores.append(scores)
# Changing to miscalssification error
MSE = [1-x for x in ac_scores]
# Determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print('The Optimal number of neighbors is %d'% optimal_k)
# Misclassification error Vs. K
plt.plot(neighbors, MSE, color = 'red')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
# Using k = 7 for the final model for prediction
knn = KNeighborsClassifier(n_neighbors = 1)
# Fitting the model
knn_fit = knn.fit(x2_train, y2_train)
y2_train_pred_knn = knn_fit.predict(x2_train)
print('Training Model Accuracy value for k=1: {0:0.2f}% '.format(metrics.accuracy_score(y2_train, y2_train_pred_knn)))
print('Training Model recall value for k = 1:{0:0.2f}%'.format(metrics.recall_score(y2_train, y2_train_pred_knn)))
print()
y2_test_pred_knn = knn_fit.predict(x2_test)
print('Testing Model Accuracy value for k=1: {0:0.2f}%'.format(metrics.accuracy_score(y2_test, y2_test_pred_knn)))
print('Testing Model Recall value for k=1: {0:0.2f}%'.format(metrics.recall_score(y2_test, y2_test_pred_knn)))
cmKNN2 = metrics.confusion_matrix(y2_test, y2_test_pred_knn, labels = [1,0])
df_cmKNN2 = pd.DataFrame(cmKNN2, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['predict 1', 'Predict 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmKNN2, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
The FN value has been increased after scaling the data based upon the VIF and p-score
print(metrics.classification_report(y2_test, y2_test_pred_knn, labels = [1,0]))
tempresultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['KNN'], 'Accuracy(%)': [metrics.accuracy_score(y2_test, y2_test_pred_knn)*100]})
resultsdf_minmax_VIF = pd.concat([resultsdf_minmax_VIF, tempresultsdf_minmax_VIF])
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
from sklearn.svm import SVC
# Building a support vector Machine on train data
svc_model = SVC(C =0.1, kernel = 'linear', gamma = 1)
svc_fit_VIF = svc_model.fit(x2_train, y2_train)
y2_train_pred_svc_VIF = svc_fit_VIF.predict(x2_train)
print('Training Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y2_train,y2_train_pred_svc_VIF)))
y2_test_pred_svc_VIF = svc_fit_VIF.predict(x2_test)
print('Testing Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y2_test, y2_test_pred_svc_VIF)))
cmSVC2_VIF = metrics.confusion_matrix(y2_test, y2_test_pred_svc_VIF, labels = [1,0])
df_cmSVC2_VIF = pd.DataFrame(cmSVC2_VIF, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [ i for i in ['Predict 1', 'Predict0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC2_VIF, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_test_pred_svc_VIF, labels = [0,1]))
tempresultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['SVM'], 'Accuracy(%)': [accuracy_score(y2_test, y2_test_pred_svc_VIF)*100]})
resultsdf_minmax_VIF = pd.concat([resultsdf_minmax_VIF, tempresultsdf_minmax_VIF])
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
rfcl_VIF = RandomForestClassifier(n_estimators = 50, random_state = 559, max_features = 4)
rfcl_VIF = rfcl_VIF.fit(x2_train, y2_train)
y2_predict_rfcl_VIF = rfcl_VIF.predict(x2_test)
print(rfcl_VIF.score(x2_train, y2_train))
print(rfcl_VIF.score(x2_test, y2_test))
cm_rfcl2_VIF = metrics.confusion_matrix(y2_test, y2_predict_rfcl_VIF, labels = [1,0])
df_cm_rfcl2_VIF = pd.DataFrame(cm_rfcl2_VIF, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl2_VIF, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_predict_rfcl_VIF, labels = [0,1]))
tempresultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['Random Forest w/o tuning'], 'Accuracy(%)': [rfcl_VIF.score(x2_test, y2_test)*100]})
resultsdf_minmax_VIF = pd.concat([resultsdf_minmax_VIF, tempresultsdf_minmax_VIF])
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
rfcl_tuned_VIF = RandomForestClassifier(n_estimators = 50, random_state = 559, criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 4, min_samples_leaf=4)
rfcl_tuned_VIF = rfcl_tuned_VIF.fit(x2_train, y2_train)
test_pred_tuned_rfcl2_VIF = rfcl_tuned_VIF.predict(x2_test)
print(rfcl_tuned_VIF.score(x2_train, y2_train))
print(rfcl_tuned_VIF.score(x2_test, y2_test))
cm_rfcl_tuned2_VIF = metrics.confusion_matrix(y2_test, test_pred_tuned_rfcl2_VIF, labels = [1,0])
df_cm_rfcl_tuned2_VIF = pd.DataFrame(cm_rfcl_tuned2_VIF, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_rfcl_tuned2_VIF, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
Compare to the VIF model without tuning , here with VIF & tuning of the hyper parameter we could decrease the value of FN by 1 only, but it is not better than the model without VIF.
print(metrics.classification_report(y2_test,test_pred_tuned_rfcl2_VIF, labels = [0,1]))
f1-score and accuracy of the test model were decreased but, the model is free from overfitting and met our requirement to minimize the value of False Negative
tempresultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['Random Forest with tuning'], 'Accuracy(%)': [rfcl_tuned_VIF.score(x2_test, y2_test)*100]})
resultsdf_minmax_VIF = pd.concat([resultsdf_minmax_VIF, tempresultsdf_minmax_VIF])
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
xgbc_VIF = xgboost.XGBClassifier()
xgbc_VIF.fit(x2_train, y2_train)
# Predicting the Test set results
y2_pred_xgbc_VIF = xgbc_VIF.predict(x2_test)
print(xgbc_VIF.score(x2_train, y2_train))
print(xgbc_VIF.score(x2_test, y2_test))
cm_xgbc2_VIF = metrics.confusion_matrix(y2_test, y2_pred_xgbc_VIF, labels = [1,0])
df_cm_xgbc2_VIF = pd.DataFrame(cm_xgbc2_VIF, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc2_VIF, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_pred_xgbc_VIF, labels = [0,1]))
tempresultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['XGBoost w/o tuning'], 'Accuracy(%)': [xgbc_VIF.score(x2_test, y2_test)*100]})
resultsdf_minmax_VIF = pd.concat([resultsdf_minmax_VIF, tempresultsdf_minmax_VIF])
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
# 'learning_rate' - Model learning rate defined
# 'max_depth'- maximum depth of the tree we want to be, for minimizing overfitting of model
# 'min_child_weight'- minimum weight of the child required in our model
# 'gamma'- gamma value should be as low as possible to control the complexity of the model
# 'colsample_bytree'- Defining the subsample ratio of the columns for each column split
xgbc_tune2_VIF = xgboost.XGBClassifier(booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5, gamma=0.4, learning_rate=0.1,
max_delta_step=0, max_depth=5, min_child_weight=7, missing=None,
n_estimators=100,random_state=293 )
xgbc_tune2_VIF.fit(x2_train, y2_train)
# Predicting the Test set results
y2_pred_xgbc_tune_VIF = xgbc_tune2_VIF.predict(x2_test)
print(xgbc_tune2_VIF.score(x2_train, y2_train))
print(xgbc_tune2_VIF.score(x2_test, y2_test))
cm_xgbc_tune2_VIF = metrics.confusion_matrix(y2_test, y2_pred_xgbc_tune_VIF, labels = [1,0])
df_cm_xgbc_tune2_VIF = pd.DataFrame(cm_xgbc_tune2_VIF, index = [i for i in ['Actual 1', 'Actual 0']],
columns = [i for i in ['Predicted 1', 'predicted 0']])
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (8,5))
sns.heatmap(df_cm_xgbc_tune2_VIF, annot = True, fmt = 'g',cmap = colormap,linecolor = 'red');
print(metrics.classification_report(y2_test,y2_pred_xgbc_tune_VIF, labels = [0,1]))
tempresultsdf_minmax_VIF = pd.DataFrame({'Method_scaling(MinMax Scalar with VIF)': ['XGBoost with tuning'], 'Accuracy(%)': [xgbc_tune2_VIF.score(x2_test, y2_test)*100]})
resultsdf_minmax_VIF = pd.concat([resultsdf_minmax_VIF, tempresultsdf_minmax_VIF])
resultsdf_minmax_VIF = resultsdf_minmax_VIF[['Method_scaling(MinMax Scalar with VIF)', 'Accuracy(%)']]
resultsdf_minmax_VIF
resultsdf_minmax_VIF
fig = plt.figure(figsize = (18,5))
plt.title ('Accuracy values for various models/techniques with Scaling (MinMax Scalar & VIF)',y=1, size = 20)
sns.barplot(y = resultsdf_minmax_VIF['Accuracy(%)'], x = resultsdf_minmax_VIF['Method_scaling(MinMax Scalar with VIF)'] );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique with scaling via MinMax Scalar combined with VIF', size = 20)
plt.tight_layout()
Compareing the XGBoost model without tuning and with tuning, the test accuracy remained same, however a little increase in f1-score is observed with a decrease of FN value by 2. This accuracy and f-score gives us the best model after tuning which is totally free from overfiting.
It can also be observed that, the maximum accuracy has been attained by Random Forest and XGBoost which is around 89%. Whereas the training accuracy for both the Random Forest and XGBoost is 100%, straight way it signifies the overfitting of model with inclusion of noise. Hence to avoid such kind of complexity, we have tuned these two models, after tuning, we have 84% & 89% accuracy for Random Forest & XGBoost models respectively, hence, the maximum accuracy has been attained XGBoost model as 89% followed by KNN model followed by Random Forest after performing the feature engineering with the help of VIF score and p1-score.
$$ Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$$Here, we have tried to study the data pattern and correlation among all the features with the help of various kind of plots like univariate, bivariate and multivariate plots. To check the skewness and spread of the data points for each features Violin plots and box plots were also drawn. In-addition, skewness matrix has also been checked to have a clear idea on degree of skewness. From boxplots, presence of outliers were witnessed and to nullify the effect of outliers scaling has been done with the help of Z-score. To check the correlation between the feature columns pair plot, correlation matrix and correlation heat map also been drawn. Those above three plots/matrix clearly indicated a strong correlation or the multicollinearity between many features. In fact some of the features were seen to be duplicated of one another. To treat such kind of problems VIF and p-score were taken into consideration while calculating the accuracy values and f1-score of various models.
To calculate classification results various methodologies were imputed such as some standard classification methodologies (Logistic Regression, GNB, KNN, and SVM), some Standard Ensemble Models (Random Forest and XGBoost) and Stacking methodology using sub classifier and Meta classifiers.
The classification results were computed with the confusion matrix metric, in terms of true positive (TP), True negative (TN), False Positive (FP), False Negative (FN). Based on the confusion matrix, the parameters of overall Accuracy, Recall / Sensitivity / TPR, Precision and F1-score were calculated as:
$$ Recall / Sensitivity / TPR = \frac{TP}{TP + FN}$$
$$ Precision = \frac{TP}{TP + FP}$$
$$ f1-score = \frac{2 * Recall * Precision}{Recall + Precision}$$
To ensure the highest accuracy with f1-score, various scaling methodologies were adopted for feature engineering process, such as Standardisation, Normalisation, VIF, p-value (Statistical Significance) etc...
fig = plt.figure(figsize = (18,20))
plt.subplot(4,1,1)
plt.title ('Accuracy values for various models/techniques without any scaling',y=1, size = 22, color = 'red')
sns.barplot(y = resultsdf['Accuracy (%)'], x = resultsdf['Technique w/o scaling'], facecolor = (0,1,0,0), linewidth = 5, edgecolor = sns.color_palette ('dark', 9) );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique without scaling', size = 20)
plt.tight_layout()
plt.subplot(4,1,2)
plt.title ('Accuracy values for various models/techniques with Scaling (Standard Scalar)',y=1, size = 22, color = 'red')
sns.barplot(y = resultsdf_scale1['Accuracy(%)'], x = resultsdf_scale1['Method_scaling(Std. Scalar)'], facecolor = (0,1,0,0), linewidth = 5, edgecolor = sns.color_palette ('dark', 9) );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique with scaling via Standard Scalar', size = 20)
plt.tight_layout()
plt.subplot(4,1,3)
plt.title ('Accuracy values for various models/techniques with Scaling (MinMax Scalar)',y=1, size = 22, color = 'red')
sns.barplot(y = resultsdf_minmax['Accuracy(%)'], x = resultsdf_minmax['Method_scaling(MinMax Scalar)'], facecolor = (0,1,0,0), linewidth = 5, edgecolor = sns.color_palette ('dark', 9));
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique with scaling via MinMax Scalar', size = 20)
plt.tight_layout()
plt.subplot(4,1,4)
plt.title ('Accuracy values for various models/techniques with Scaling (MinMax Scalar & VIF)',y=1, size = 22, color = 'red')
sns.barplot(y = resultsdf_minmax_VIF['Accuracy(%)'],x = resultsdf_minmax_VIF['Method_scaling(MinMax Scalar with VIF)'], facecolor = (0,1,0,0), linewidth = 5, edgecolor = sns.color_palette ('dark', 9) );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Technique with scaling via MinMax Scalar combined with VIF', size = 20);
From the above bar plot, below interpretation were drawn;
For the case, where we have not performed any kind of scaling,
Accuracy:
- Random Forest (88%) > XGBoost & SVM (86%) > KNN (83%) > Logistic Regression (81%) > GNB (74%)
F1-score:
- XGBoost (0.96)> Random Forest (0.93) > SVM (0.92) > KNN (0.89) > Logistic Regression (0.88) > GNB (0.8)
FN value :
- SVM (0) < Random Forest (1) < XGBoost (2) < KNN & LR (3) < GNB (15)
For the case scaling with Standard Scalar,
Accuracy:
- KNN (93%) > Random Forest & SVM (89%) > Logistic Regression (88%)> XGBoost (86%) > GNB (71%)
F1-score:
- KNN (0.96) > Random Forest (0.94) & SVM (0.94) )> Logistic Regression (0.92) > XGBoost (0.91) > GNB (0.77)
FN value:
- SVM (0) & Random Forest (0) < XGBoost (2) & KNN & LR (2) < GNB (16)
For the case scaling with MinMax Scalar,
Accuracy:
- KNN (91%) > Random Forest (89%) > Logistic Regression (86%) & XGBoost (86%) > SVM (76%) > GNB (71%)
F1-score:
- KNN (0.95) > Random Forest (0.94) > Logistic Regression (0.92) > XGBoost (0.91) > SVM (0.87) > GNB (0.77)
FN value:
- SVM & LR & Random Forest (0) < KNN (1) < XGBoost (2) < GNB (16)
For the case scaling with VIF & MinMax Scalar,
Accuracy:
- XGBoost (89%) > KNN (88%) > Random Forest (84%) & Logistic Regression (84%) > GNB (77%) > SVM (76%)
F1-score:
- XGBoost (0.94) > KNN (0.92) > Random Forest (0.91) &Logistic Regression (0.91) > SVM (0.87) > GNB (0.86)
FN value:
- SVM (0) < LR (1) & XGBoost (1) < Random Forest (2) & KNN (2) < GNB (06)
- Random Forest is the best model to predict the Parkinson's Dieses from the Voice Records of Patients, when we don't scale the data.
- Support Vector Machine & Random Forest followed by KNN model are the best models to predict the Parkinson's Dieses from the Voice Records of Patients, when we scale the data with the help of Standard Scalar.
- Random Forest followed by KNN are the two best models to predict the Parkinson's Dieses from the Voice Records of Patients, when we scale the data with the help of MinMaxScalar.
- XGBoost is the best model to predict the Parkinson's Dieses from the Voice Records of Patients, when we scale the data with the help of VIF & MinMax Scalar.
Note:
- After imputing VIF and p-value to check the Variable inflation rate, we have deleted lot of columns, which in turn decreased the accuracy and f1 score of most of the models. Further improvements can be done to enhance the accuracy and f1-score while retaining the same FN value.
Stacking - In this method, we used 10-fold and 3-flod cross validation techniques to test the generalization capability of classification method. These two CV techniques first divided the entire set of 105 vocal instances into 10 and 3 disjoint subsets i.e. 19.5 and 65 in each subsets respectively. In each validation procedure, one subset was selected for testing, and the remaining subsets were used for training the classifiers. Such validation steps were carried out repeatedly until all 10 or 3 subsets had been tested for pattern classification.
The present work also has some limitation. The correlations between a numbers of vocal parameters limit the performance improvement of classifiers. In addition, the current voice dataset with a relative small size (195 records) limits the further study and analysis of vocal parameters and patterns, as well as the effectiveness of all classifiers. Classification performance of some of the models or techniques still needs to be evaluated with a much larger data and with more feature engineering concepts.